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ABSTRACT 

Measuring the morphological parameters of galaxies is a key requirement for studying 
their formation and evolution. Surveys such as the Sloan Digital Sky Survey (SDSS) 
have resulted in the availability of very large collections of images, which have per¬ 
mitted population-wide analyses of galaxy morphology. Morphological analysis has 
traditionally been carried out mostly via visual inspection by trained experts, which 
is time-consuming and does not scale to large (> 10^) numbers of images. 

Although attempts have been made to build automated classification systems, 
these have not been able to achieve the desired level of accuracy. The Galaxy Zoo 
project successfully applied a crowdsourcing strategy, inviting online users to classify 
images by answering a series of questions. Unfortunately, even this approach does not 
scale well enough to keep up with the increasing availability of galaxy images. 

We present a deep neural network model for galaxy morphology classification which 
exploits translational and rotational symmetry. It was developed in the context of the 
Galaxy Challenge^ an international competition to build the best model for morphology 
classification based on annotated images from the Galaxy Zoo project. 

For images with high agreement among the Galaxy Zoo participants, our model 
is able to reproduce their consensus with near-perfect accuracy (> 99%) for most 
questions. Gonfident model predictions are highly accurate, which makes the model 
suitable for filtering large collections of images and forwarding challenging images to 
experts for manual annotation. This approach greatly reduces the experts’ workload 
without affecting accuracy. The application of these algorithms to larger sets of training 
data will be critical for analysing results from future surveys such as the LSST. 

Key words: methods: data analysis - catalogues - techniques: image processing - 
galaxies: general. 


1 INTRODUCTION 

Galaxies exhibit a wide variety of shapes, colours and sizes. 
These properties are indicative of their age, formation con¬ 
ditions, and interactions with other galaxies over the course 
of many Gyr. Studies of galaxy formation and evolution 
use morphology to probe the physical processes that give 
rise to them. In particular, large, all-sky surveys of galax¬ 
ies are critical for disentangling the complicated relation¬ 
ships between parameters such as halo mass, metallicity, en¬ 
vironment, age, and morphology; deeper surveys probe the 
changes in morphology starting at high redshifts and taking 
place over timescales of billions of years. 

Such studies require both the observation of large num- 
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bers of galaxies and accurate classification of their morpholo¬ 
gies. Large-scale surveys such as the Sloan Digital Sky Sur¬ 
vey (SDSS)^ have resulted in the availability of image data 
for millions of celestial objects. However, manually inspect¬ 
ing all these images to annotate them with morphological 
information is impractical for either individual astronomers 
or small teams. 

Attempts to build automated classification systems for 
galaxy morphologies have historically had difficulties in 
reaching the levels of reliability required for scientific anal¬ 
ysis (Clery 2011). The Galaxy Zoo project^ was conceived 
to accelerate this task through the method of crowdsourc¬ 
ing. The original goal of the project was to obtain reliable 


^ http://www.sdss.org/ 

^ http://www.galaxyzoo.org/ 
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morphological classifications for ~ 900, 000 galaxies by al¬ 
lowing members of the public to contribute classifications 
via a web platform. The project was much more successful 
than anticipated, with the entire catalog being annotated 
within a timespan of several months (originally projected to 
take years). Since its original inception, several iterations of 
the project with different sets of images and more detailed 
classification taxonomies have followed. 

Two recent sets of developments since the launch of 
Galaxy Zoo have made an automated approach more fea¬ 
sible: first, the large strides in the fields of image classifica¬ 
tion and computer vision in general, primarily through the 
use of deep neural networks (Krizhevsky et al. 2012; Raza- 
vian et al. 2014). Although neural networks have existed 
for several decades (McCulloch & Pitts 1943; Fukushima 
1980), they have recently returned to the forefront of ma¬ 
chine learning research. A significant increase in available 
computing power, along with new techniques such as rec¬ 
tified linear units (Nair Sz Hinton 2010) and dropout regu¬ 
larization (Hinton et al. 2012; Srivastava et al. 2014), have 
made it possible to build more powerful neural network mod¬ 
els (see Section 5.1 for descriptions of these techniques). 

Secondly, large sets of reliably annotated images of 
galaxies are now available as a consequence of the success of 
Galaxy Zoo. Such data can be used to train machine learn¬ 
ing models and increase the accuracy of their morphological 
classifications. Deep neural networks in particular tend to 
scale very well as the number of available training examples 
increases. Nevertheless it is also possible to train deep neural 
networks on more modestly sized datasets using techniques 
such as regularization, data augmentation, parameter shar¬ 
ing and model averaging, which we discuss in Section 7.2 and 
following. 

An automated approach is also becoming indispensable: 
modern telescopes continue to collect more and more images 
every day. Future telescopes will vastly increase the num¬ 
ber of galaxy images that can be morphologically classified, 
including multi-wavelength imaging, deeper fields, synoptic 
observing, and true all-sky coverage. As a result, the crowd¬ 
sourcing approach cannot be expected to scale indefinitely 
with the growing amount of data. Supplementing both ex¬ 
pert and crowdsourced catalogues with automated classifi¬ 
cations is a logical and necessary next step. 

In this paper, we propose a convolutional neural network 
model for galaxy morphology classification that is specifi¬ 
cally tailored to the properties of images of galaxies. It effi¬ 
ciently exploits both translational and rotational symmetry 
in the images, and autonomously learns several levels of in¬ 
creasingly abstract representations of the images that are 
suitable for classification. The model was developed in the 
context of the Galaxy Challenge^, an international competi¬ 
tion to build the best model for automatic galaxy morphol¬ 
ogy classification based on a set of annotated images from the 
Galaxy Zoo 2 project. This model finished in first place out 
of 326 participants^. The model can efficiently and automat¬ 
ically annotate catalogs of images with morphology informa- 

^ https: //www.haggle. com/c/galaxy-zoo-the-galaxy-challenge 
The model was independently developed by SD for the Kaggle 
competition. KWW co-designed and administered the competi¬ 
tion, but shared no data or code with any participants until after 
the closing date. 


tion, enabling quantitative studies of galaxy morphology on 
an unprecedented scale. 

The rest of this paper is structured as follows: we in¬ 
troduce the Galaxy Zoo project in Section 2 and Section 3 
explains the set up of the Galaxy Ghallenge. We discuss re¬ 
lated work in Section 4. Gonvolutional neural networks are 
described in Section 5, and our method to incorporate rota¬ 
tion invariance in these models is described in Section 6. We 
provide a complete overview of our modelling approach in 
Section 7 and report results in Section 8. We analyse the 
model in Section 9. Finally, we draw conclusions in Sec¬ 
tion 10. 


2 GALAXY ZOO 

Galaxy Zoo is an online crowdsourcing project where users 
are asked to describe the morphology of galaxies based on 
colour images (Lintott et al. 2008, 2011). Our model and 
analysis uses data from the Galaxy Zoo 2 iteration of the 
project, which uses colour images from the SDSS and a more 
detailed classification scheme than the original project (Wil¬ 
lett et al. 2013). Participants are asked various questions 
such as ‘How rounded is the galaxy?’ and ‘Does it have a 
central bulge?’, with the users’ answers determining which 
question will be asked next. The questions form a decision 
tree (Figure 1) which is designed to encompass all points 
in the traditional Hubble tuning fork as well as a range of 
more irregular morphologies. The classification scheme has 
11 questions and 37 answers in total (Table 1). 

Because of the structure of the decision tree, each indi¬ 
vidual participant answered only a subset of the questions 
for each classification. When many participants have classi¬ 
fied the same image, their answers are aggregated into a set 
of weighted vote fractions for the entire decision tree. These 
vote fractions are used to estimate confidence levels for each 
answer, and are indicative of the difficulty users experienced 
in classifying the image. More than half a million people have 
contributed classifications to Galaxy Zoo, with each image 
independently classified by 40 to 50 people^. 

Data from the Galaxy Zoo projects have already been 
used in a wide variety of studies on galaxy structure, for¬ 
mation, and evolution (Skibba et al. 2009; Bamford et al. 
2009; Schawinski et al. 2009; Lintott et al. 2009; Darg et al. 
2010; Masters et al. 2010, 2011; Simmons et al. 2013; Melvin 
et al. 2014; Willett et al. 2015). Gomparisons of Galaxy Zoo 
morphologies to smaller samples from both experts and au¬ 
tomated classifications show high levels of agreement, testi¬ 
fying to the accuracy of the crowdsourced annotations (Bam¬ 
ford et al. 2009; Willett et al. 2013). 


3 THE GALAXY CHALLENGE 

Our model was developed in the context of the Galaxy 
Ghallenge, an online international competition organized by 

^ Note that the vote fractions are post-processed to increase their 
reliability, for example by weighting users based on their consis¬ 
tency with the majority, and by compensating for classification 
bias induced by different image apparent magnitudes and sizes 
(Willett et al. 2013). 
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I 

How rounded is it? 





Is there anything odd? 



Is the odd feature a ring, or is the 
galaxy disturbed or irregular? 


<s> 

□ 

Q 

□ 

□ 



□ 



Is the galaxy simply smooth and rounded, 
with no sign of a disk? 



I 

Does the galaxy have a bulge at its centre? 
If so, what shape? 



Could this be a disk viewed edge-on? 


I 

Is there a sign of a bar feature through 
the centre of the galaxy? 


s s 




_I 

How tightly wound do the spiral arms appear? 


Is there any sign of a spiral 
arm pattern? 



§ 


How many spir^ arms are there? 









Figure 1. The Galaxy Zoo 2 decision tree. Reproduced from Figure 1 in Willett et al. (2013). 


Galaxy Zoo, sponsored by Winton Capital, and hosted on 
the Kaggle platform for data prediction contests. It was held 
from December 20^^, 2013 to April 4^^, 2014. The goal of the 
competition was to build a model that could predict galaxy 
morphology from images like the ones that were used in the 
Galaxy Zoo 2 project. 

Images of galaxies and morphological data for the com¬ 
petition were taken from the Galaxy Zoo 2 main spectro¬ 
scopic sample. Galaxies were selected to cover the full ob¬ 
served range of morphology, colour, and size, since the goal 
was to develop a general algorithm that could be applied 
to many types of images in future surveys. The total num¬ 
ber of images provided is limited both by the imaging depth 
of SDSS and the elimination of both uncertain and over¬ 
represented morphological categories as a function of colour 
(primarily red elliptical and blue spiral galaxies). This helped 
to ensure that colour is not used as a proxy for morphology, 
and that a high-performing model would be based purely on 
the images’ structural parameters. 

The final training set of data consisted of 61,578 JPEG 
colour images of galaxies, along with probabilities® for each 
of the 37 answers in the decision tree. An evaluation set 
of 79,975 images was also provided, but with no morpholog- 


® These are actually post-processed vote fractious obtaiued froru 
the Galaxy Zoo participauts’ auswers, but we treat theru as prob¬ 
abilities iu this paper. 


ical data - the goal of the competition was to predict these 
values. Each image is 424 by 424 pixels in size. The morpho¬ 
logical data provided was a modified version of the weighted 
vote fractions in the GZ2 catalog; these were transformed 
into “cumulative” probabilities that gave higher weights to 
more fundamental morphological categories higher in the de¬ 
cision tree. Images were anonymized from their SDSS IDs, 
with any use of metadata (such as colour, size, position, or 
redshift) to train the algorithm explicitly forbidden by the 
competition guidelines. 

Because the goal was to predict probabilities for each an¬ 
swer, as opposed to determining the most likely answer for 
each question in the decision tree, the models built by par¬ 
ticipants were actually solving a regression problem, and not 
a classification problem in the strictest sense. The predictive 
performance of a model was determined by computing the 
root-mean-square error (RMSE) between predictions on the 
evaluation set and the corresponding crowdsourced probabil¬ 
ities. Let pk be the answer probabilities associated with an 
image (/c = 1... 37), and pk the corresponding predictions. 
Then the RMSE e{pk,Pk) can be computed as follows: 


e{pk,Pk) = 


\ -PkY- 

\ k=l 


( 1 ) 


This metric was chosen because it puts more emphasis on 
questions with higher answer probabilities, i.e. the topmost 
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question 


answers 

next 

Q1 

Is the galaxy simply smooth and 
rounded, with no sign of a disk? 

Al.l 

A1.2 

A1.3 

smooth 

features or disk 

star or artifact 

Q7 

Q2 

end 

Q2 

Could this be a disk viewed 
edge-on? 

A2.1 

A2.2 

yes 

no 

Q9 

Q3 

Q3 

Is there a sign of a bar feature 
through the centre of the galaxy? 

A3.1 
A3.2 

yes 

no 

Q4 

Q4 

Q4 

Is there any sign of a spiral arm 
pattern? 

A4.1 

A4.2 

yes 

no 

QIO 

Q5 

Q5 

How prominent is the central 
bulge, compared with the rest of 
the galaxy? 

A5.1 

A5.2 

A5.3 

A5.4 

no bulge 

just noticeable 

obvious 

dominant 

Q6 

Q6 

Q6 

Q6 

Q6 

Is there anything odd? 

A6.1 

A6.2 

yes 

no 

Q8 

end 

Q7 

How rounded is it? 

A7.1 

A7.2 

A7.3 

completely round 
in between 
cigar-shaped 

Q6 

Q6 

Q6 

Q8 

Is the odd feature a ring, or is the 
galaxy disturbed or irregular? 

A8.1 

A8.2 

A8.3 

A8.4 

A8.5 

A8.6 

A8.7 

ring 

lens or arc 

disturbed 

irregular 

other 

merger 

dust lane 

end 

end 

end 

end 

end 

end 

end 

Q9 

Does the galaxy have a bulge at its 
centre? If so, what shape? 

A9.1 

A9.2 

A9.3 

rounded 
boxy 
no bulge 

Q6 

Q6 

Q6 

QIO 

How tightly wound do the spiral 
arms appear? 

AlO.l 

A10.2 

A10.3 

tight 

medium 

loose 

Qll 

Qll 

Qll 

Qll 

How many spiral arms are there? 

All.l 

All.2 

All.3 

All.4 

All.5 

All.6 

1 

2 

3 

4 

more than four 

can’t tell 

Q5 

Q5 

Q5 

Q5 

Q5 

Q5 


Table 1. All questions that can be asked about an image, with the 
corresponding answers that participants can choose from. Ques¬ 
tion 1 is the only one that is asked of every image. The final 
column indicates the next question to be asked when a particular 
answer is given. Reproduced from Table 2 in Willett et al. (2013). 


questions in the decision tree, which correspond to broader 
morphological categories. 

The provided answer probabilities are derived from 
crowdsourced classifications, so they are somewhat noisy and 
biased in certain ways. As a result, the predictive models that 
were built exhibited some of the same biases. In other words, 
they are models of how the crowd would classify images of 
galaxies, which may not necessarily correspond to the “true” 
morphology. An example of such a discrepancy is discussed 
in Section 9. 

The models built by participants were evaluated as 
follows. The Kaggle platform automatically computed two 
scores based on a set of model predictions: a public score, 
computed on about 25% of the evaluation data, and a private 
score, computed on the other 75%. Public scores were im¬ 
mediately revealed during the course of the competition, but 
private scores were not revealed until after the competition 
had finished. The private score was used to determine the 
final ranking. Because the participants did not know which 
evaluation images belonged to which set, they could not di¬ 
rectly optimize the private score, but were instead encour¬ 


aged to build a predictive model that generalizes well to new 
images. 


4 RELATED WORK 

Machine learning techniques, and artificial neural networks 
in particular, have been a popular tool in astronomy research 
for more than two decades. Neural networks were initially 
applied for star-galaxy discrimination (Odewahn et al. 1992; 
Bert in 1994) and classification of galaxy spectra (Folkes et al. 
1996). More recently they have also been used for photomet¬ 
ric redshift estimation (Firth et al. 2003; Collister Sz Lahav 
2004). 

Galaxy morphology classification is one of the most 
widespread applications of neural networks in astronomy. 
Most work in this domain proceeds by preprocessing the 
photometric data and then extracting a limited set of hand¬ 
crafted features that are known to be discriminative, such as 
ellipticity, concentration, surface brightness, and radii and 
log-likelihood values measured from various types of radial 
profiles (Storrie-Lombardi et al. 1992; Lahav et al. 1995; 
Naim et al. 1995; Lahav et al. 1996; Bah et al. 2004; Banerji 
et al. 2010). Support vector machines (SVMs) have also been 
applied in this fashion (Huertas-Company et al. 2010). 

Earlier work in this domain typically relied on much 
smaller datasets and used networks with very few trainable 
parameters (between 10^ and 10^). Modern network archi¬ 
tectures are capable of handling at least ~ 10^ parameters, 
allowing for more precise fits and a larger morphological 
classification space. More recent work has profited from the 
availability of larger training sets using data from surveys 
such as the SDSS (Banerji et al. 2010; Huertas-Company 
et al. 2010). 

Another recent trend is the use of general purpose im¬ 
age features, instead of features that are specific to galaxies: 
the WND-CHARM feature set (Orlov et al. 2008), originally 
designed for biological image analysis, has been applied to 
galaxy morphology classification in combination with near¬ 
est neighbour classifiers (Shamir 2009; Shamir et al. 2013; 
Kuminski et al. 2014). 

Other approaches to this problem attempt to forgo any 
form of handcrafted feature extraction by applying principal 
component analysis (PCA) to preprocessed images in com¬ 
bination with a neural network (De La Calleja & Fuentes 
2004), or by applying kernel SVMs directly to raw pixel data 
(Polsterer et al. 2012). 

Our approach differs from prior work in two main areas: 

• Most prior work uses handcrafted features (e.g., WND- 
CHARM) that required expert knowledge and many hours of 
engineering to develop. We work directly with raw pixel data 
and our use of convolutional neural networks allows for a set 
of task-specific features to be learned from the data. The 
networks learn hierarchies of features, which allow them to 
detect complex patterns in the images. Handcrafted features 
mostly rely on image statistics and very local pattern detec¬ 
tors, making it harder to recognize complex patterns. Fur¬ 
thermore, it is usually necessary to perform feature selection 
because the handcrafted representations are highly redun¬ 
dant and many features are irrelevant for the task at hand. 
Although many other participants in the Galaxy Challenge 
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used convolutional neural networks, there is little discussion 
of this approach in the astronomical literature. 

• Apart from the recent work of Kuminski et ah (2014), 
whose algorithms are also trained on Galaxy Zoo data, most 
research has focused on classifying galaxies into a limited 
number of classes (typically between 2 and 5), or predict¬ 
ing scalar values that are indicative of galaxy morphology 
(e.g., Hubble T-types). Since the classifications made by 
Galaxy Zoo users are much more fine-grained, the task the 
networks must solve is more challenging. Since many out¬ 
standing astrophysical questions require more detailed mor¬ 
phological data (such as the number and arrangements of 
clumps into proto-galaxies, the relation between bar strength 
and central star formation, link between merging activity 
and active black holes, etc.), development of models that 
can handle these more difficult tasks is crucial. 

Our method for classifying galaxy morphology exploits 
the rotational symmetry of galaxy images; however, there 
are other invariances and symmetries (besides translational) 
that may be exploited for convolutional neural networks. 
Bruna et al. (2013) define convolution operations over ar¬ 
bitrary graphs, generalizing from the typical grid of pixels 
to other locally connected structures. Sifre & Mallat (2013) 
extract representations that are invariant to affine transfor¬ 
mations, based on scattering transforms. However, these rep¬ 
resentations are fixed (i.e., not learned from data), and not 
specifically tuned for the task at hand, unlike the represen¬ 
tations learned by convolutional neural networks. 

Mairal et al. (2014) propose to train convolutional neu¬ 
ral networks to approximate kernel feature maps, allowing 
for the desired invariance properties to be encoded in the 
choice of kernel, and subsequently learned. Gens & Domingos 
(2014) propose deep symmetry networks, a generalization of 
convolutional neural networks with the ability to form fea¬ 
ture maps over any symmetry group, rather than just the 
translation group. Our approach for exploiting rotational 
symmetry in the input images, described in Section 6, is 
quite similar in spirit to this work. The major advantage to 
our implementation is a demonstrably effective result at a 
reasonable computational cost. 


5 BACKGROUND 
5.1 Deep learning 

The idea of deep learning is to build models that represent 
data at multiple levels of abstraction, and can discover accu¬ 
rate representations autonomously from the data itself (Ben- 
gio 2007). Deep learning models consist of several layers of 
processing that form a hierarchy: each subsequent layer ex¬ 
tracts a progressively more abstract representation of the 
input data and builds upon the representation from the pre¬ 
vious layer, typically by computing a non-linear transforma¬ 
tion of its input. The parameters of these transformations 
are optimized by training the model on a dataset. 

A feed-forward neural network is an example of such 
a model, where each layer consists of a number of units 
(or neurons) that compute a weighted linear combination 
of the layer input, followed by an elementwise non-linearity. 
These weights constitute the model parameters. Let the vec¬ 
tor Xn-i be the input to layer n, Wn be a matrix of weights. 


output 



xw 

(O O O O O) 


xw -1 

fo O O O Ol 


XW-2 

XI 

(O O O O Ol 


XO 


input 


layer N (output) 

xw = /(Wa^x^v-i + biv) 


layer N — 1 (hidden) 

Xiv_l = /(WiV-lXiV-2 + biV-l) 


layer 1 (hidden) 

XI =/(Wixo+bi) 


Figure 2. Schematic representation of a feed-forward neural net¬ 
work with N layers. 


and bn be a vector of biases. Then the output of layer n can 
be represented as the vector 


Xn = /(WnXn-1 + bn), (2) 

where / is the aetivation funetion, an elementwise non-linear 
function. Gommon choices for the activation function are lin¬ 
ear rectification [f(x) = max(x, 0)], which gives rise to reeti- 
fied linear units (ReLUs; Nair Sz Hinton 2010), or a sigmoidal 
function [f(x) = (1 + e“®)“^ or f(x) = tanh(x)]. Another 
possibility is to compute the maximum across several linear 
combinations of the input, which gives rise to maxout units 
(Goodfellow et al. 2013). We will consider a network with 
N layers. The network input is represented by xq, and its 
output by xat. 

A schematic representation of a feed-forward neural net¬ 
work is shown in Figure 2. The network computes a function 
of the input xq. The output xat of this function is a predic¬ 
tion of one or more quantities of interest. We will use t to 
represent the desired output (target) corresponding to the 
network input xq. The topmost layer of the network is re¬ 
ferred to as the output layer. All the other layers below it 
are hidden layers. 

During training, the parameters of all layers of the net¬ 
work are jointly optimized to make the output xat approxi¬ 
mate the desired output t as closely as possible. We quantify 
the prediction error using an error measure e(xA^,t). As a 
result, the hidden layers will learn to produce representa¬ 
tions of the input data that are useful for the task at hand, 
and the output layer will learn to predict the desired output 
from these representations. 

To determine how the parameters should be changed 
to reduce the prediction error across the dataset, we use 
gradient deseent: the gradient of e(xAr,t) is computed with 
respect to the model parameters Wn, bn for n = 1.. .N. 
The parameter values of each layer are then modified by 
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repeatedly taking small steps in the direction opposite to 
the gradient: 


Wn^Wn-r] 
bn ^ bn - ry 


^ e(xAr,t) 

dWn ’ 

^e(xA^, t) 
^bn 


(3) 

(4) 


Here, rj is the learning rate, a hyperparameter controlling the 
step size. 

Traditionally, models with many non-linear layers of 
processing have not been commonly used because they were 
difficult to train: gradient information would vanish as it 
propagated through the layers, making it difficult to learn 
the parameters of lower layers (Hochreiter et ah 2001). Prac¬ 
tical applications of neural networks were limited to models 
with one or two hidden layers. Since 2006, the invention of 
several new techniques, along with a significant increase in 
available computing power, have made this task much more 
feasible. 

Initially unsupervised pre-training was proposed as a 
method to facilitate training deeper networks (Hinton et al. 
2006). Single-layer unsupervised models (such as restricted 
Boltzmann machines or auto-encoders (Bengio 2007)) are 
stacked on top of each other and trained, and the learned 
parameters of these models are then used to initialize the 
parameters of a deep neural network. These are then fine- 
tuned using standard gradient descent. This initialization 
scheme makes it possible to largely avoid the vanishing gra¬ 
dient problem. Nair Sz Hinton (2010) and Glorot et al. (2011) 
proposed the use of rectified linear units (ReLUs) in deep 
neural networks. By replacing traditional activation func¬ 
tions with linear rectification, the vanishing gradient prob¬ 
lem was significantly reduced. This also makes pre-training 
unnecessary in most cases. 

The introduction of dropout regularization (Hinton 
et al. 2012; Srivastava et al. 2014) has made it possible to 
train larger networks with many more parameters. Dropout 
is a regularization method that can be applied to a layer n by 
randomly removing the output values of the previous layer 
n — 1 (setting them to zero) with probability p. Typically 
p is chosen to be 0.5. The remaining values are rescaled by 
a factor of (1 — p)~^ to preserve the scale of the total in¬ 
put to each unit in layer n. For each training example that 
is presented to the network, a different subset of values is 
removed. During evaluation, no values are removed and no 
rescaling is performed. 

Dropout is an effective regularizer because it prevents 
eoadaptation between units: each unit is forced to learn to 
be useful by itself, because its utility cannot depend on the 
presence of other units in the same layer (as they can be 
removed at random). 


5.2 Convolutional neural networks 

Convolutional neural networks or eonvnets (Fukushima 
1980; LeCun et al. 1998) are a subclass of neural networks 
with constrained connectivity patterns between some of the 
layers. They can be used when the input data exhibits some 
kind of topological structure, like the ordering of image pix¬ 
els in a grid, or the temporal structure of an audio signal. 
Convolutional neural networks contain two types of lay¬ 


ers with restricted connectivity: eonvolutional layers and 
pooling layers. A convolutional layer takes a stack of feature 
maps (e.g. the colour channels of an image) as input and con¬ 
volves each of these with a set of learnable filters to produce a 
stack of output feature maps. This is efficiently implemented 
by replacing the matrix-vector product WnXn-i in Equa¬ 
tion 2 with a sum of convolutions. We represent the input of 
layer n as a set of K matrices with k = 1... K. Each 

of these matrices represents a different input feature map. 
The output feature maps Xn\ / = 1... L are represented as 
follows: 


Here, * represents the two-dimensional convolution opera¬ 
tion, the matrices represent the filters of layer n, and 

bn^ represents the bias for feature map 1. Note that a feature 
map Xn^ is obtained by computing a sum of K convolutions 
with the feature maps of the previous layer. The bias bn^ can 
optionally be replaced by a matrix , so that each spatial 
position in the feature map has its own bias (‘untied’ bi¬ 
ases) . This allows the sensitivity of the filters to vary across 
the input. 

By replacing the matrix product with a sum of convo¬ 
lutions, the connectivity of the layer is effectively restricted 
to take advantage of the input structure and to reduce the 
number of parameters. Each unit is only connected to a lo¬ 
cal subset of the units in the layer below, and each unit is 
replicated across the entire input. This is shown in the left 
side of Eigure 3. This means that each unit can be seen as 
detecting a particular feature across the input (for example, 
an oriented edge in an image). Applying feature detectors 
across the entire input enables the exploitation of transla¬ 
tional symmetry in images. 

As a consequence of this restricted connectivity pattern, 
convolutional layers typically have far fewer parameters than 
traditional dense (or fully-eonneeted) layers that compute a 
transformation of their input according to Equation 2. This 
reduction in parameters can drastically improve generaliza¬ 
tion performance (i.e., predictive performance on unseen ex¬ 
amples) and make the model scale to larger input dimen¬ 
sionalities. 

Because convolutional layers are only able to model lo¬ 
cal correlations in the input, the dimensionality of the fea¬ 
ture maps is often reduced between convolutional layers by 
inserting pooling layers. This allows higher layers to model 
correlations across a larger part of the input, albeit with a 
lower resolution. A pooling layer reduces the dimensionality 
of a feature map by computing some aggregation function 
(typically the maximum or the mean) across small local re¬ 
gions of the input (Boureau et al. 2010), as shown in the 
right side of Eigure 3. This also makes the model invariant 
to small translations of the input, which is a desirable prop¬ 
erty for modelling images and many other types of data. 
Unlike convolutional layers, pooling layers typically do not 
have any trainable parameters. 

By alternating convolutional and pooling layers, higher 
layers in the network see a progressively more coarse rep¬ 
resentation of the input. As a result, these layers are able 
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Figure 3. A schematic overview of a convolutional layer followed by a pooling layer: each unit in the convolutional layer is connected to 
a local neighborhood in all feature maps of the previous layer. The pooling layer aggregates groups of neighboring units from the layer 
below. 


to model higher-level abstractions more easily because each 
unit is able to see a larger part of the input. 

Convolutional neural networks constitute the state of 
the art in many computer vision problems. Since their ef¬ 
fectiveness for large-scale image classification was demon¬ 
strated, they have been ubiquitous in computer vision re¬ 
search (Krizhevsky et al. 2012; Razavian et al. 2014; Szegedy 
et al. 2014; Simonyan & Zisserman 2014). 


6 EXPLOITING ROTATIONAL SYMMETRY 

The restricted connectivity patterns used in convolutional 
neural networks drastically reduce the number of parameters 
required to model large images, by exploiting translational 
symmetry. However, there are many other types of symme¬ 
tries that occur in images. For images of galaxies, rotating 
an image should not affect its morphological classification. 
This rotational symmetry is exploited by applying the same 
set of feature detectors to various rotated versions of the in¬ 
put. This further increases parameter sharing, which has a 
positive effect on generalization performance. 

Whereas convolutions provide an efficient way to exploit 
translational symmetry, applying the same filter to multiple 
rotated versions of the input requires explicitly instantiating 
these versions. Additionally, rotating an image by an angle 
that is not a multiple of 90° requires interpolation and re¬ 
sults in an image whose edges are not aligned with the rows 
and columns of the pixel grid. These complications make 
exploiting rotational symmetry more challenging. 

We note that the original Galaxy Zoo project experi¬ 
mented with crowdsourced classifications of galaxies in which 
the images were both vertically and diagonally mirrored. 
Land et al. (2008) showed that the raw votes had an excess 
of 2.5% for S-wise (anticlockwise) spiral galaxies over Z-wise 
(clockwise) galaxies. Since this effect was seen in both the 
raw and mirrored images, it was interpreted as a bias due to 
preferences in the human brain, rather than as a true excess 
in the number of apparent S-wise spirals in the Universe. 
The existence of such a directional bias in the brain was 
demonstrated by Gori et al. (2006). The Galaxy Zoo 2 prob¬ 
abilities do not contain any structures related to handedness 


or rot at ion-variant quantities, and no rotational or transla¬ 
tional biases have yet been discovered in the data. If such 
biases do exist, however, this would presumably reduce the 
predictive power of the model since the assumption of rota¬ 
tional invariance to the output probabilities would no longer 
apply. 

Our approach for exploiting symmetry is visualized in 
Figure 4. We compute rotated and flipped versions of the 
input images, which are referred to as viewpoints, and pro¬ 
cess each of these separately with the same convolutional 
network architecture, consisting of alternating convolutional 
layers and pooling layers. The output feature maps of this 
network for the different viewpoints are then concatenated, 
and one or more dense layers are stacked on top. This ar¬ 
rangement allows the dense layers to aggregate high-level 
features extracted from different viewpoints. 

In practice, we also crop the top left part of each view¬ 
point image both to reduce redundancy between the view¬ 
points and to reduce the size of the input images (and hence 
computation time). Images are cropped in such a way that 
each one contains the centre of the galaxy, because this part 
of the image tends to be very informative. The practical 
implementation of viewpoint extraction is discussed in Sec¬ 
tion 7.5, and the modified network architecture is described 
in Section 7.6. 


7 APPROACH 

In this section, we describe our practical approach to devel¬ 
oping and training a model for galaxy morphology predic¬ 
tion. We first discuss our experimental setup and the prob¬ 
lem of overfitting, which was the main driver behind our 
design decisions. We then describe the successive steps in 
our processing pipeline to obtain a set of answer probabilities 
from an image. This pipeline consists of five steps (Figure 5): 
input preprocessing, augmentation, viewpoint extraction, a 
convolutional neural network and model averaging. We also 
briefly discuss the practical implementation of the pipeline 
from a software perspective. 
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1 . input 



2. rotate 


3. crop 


4. convolutions 


5. dense 6. predictions 


Figure 4. Schematic overview of a neural network architecture for exploiting rotational symmetry. The input image (1) is first rotated to 
various angles and optionally flipped to yield different viewpoints (2), and the viewpoints are subsequently cropped to reduce redundancy 
(3). Each of the cropped viewpoints is processed by the same stack of convolutional layers and pooling layers (4), and their output 
representations are concatenated and processed by a stack of dense layers (5) to obtain predictions (6). 
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Section 7.5 
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Section 7.8 



Figure 5. Schematic overview of the processing pipeline. 
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7.1 Experimental setup 

As described in Section 3, the provided dataset consists of a 
training set with 61,578 images with associated answer prob¬ 
abilities, and an evaluation set of 79,975 images. Feedback 
could be obtained during the competition by submitting pre¬ 
dictions for the images in the evaluation set. During the com¬ 
petition, submitted predictions were scored by computing 
the RMSE on a subset of approximately 25% of the evalua¬ 
tion images. It was not revealed which images were part of 
this subset. The scores used to determine the final ranking 
were obtained by computing the RMSE on the remaining 
75% of images. This arrangement is typical for competitions 
hosted on the Kaggle platform. We split off a further 10% of 
the training set images for real-time evaluation during model 
training, and trained our models only on the remaining 90%. 

7.2 Avoiding overfitting 

Modern neural networks typically have a large number of 
learnable parameters - several million in the case of our 
model. This is in stark contrast with the limited size of the 
training set, which had only 5 x 10^ images. As a result, there 
is a high risk of overfitting \ a network will tend to memorize 
the training examples because it has enough capacity to do 
so, and will not generalize well to new data. We used several 
strategies to avoid overfitting: 

• data augmentation: extending the training set by 
randomly perturbing images in a way that leaves their asso¬ 
ciated answer probabilities unchanged; 

• regularization: penalizing model complexity through 
use of dropout (Hinton et al. 2012); 

• parameter sharing: reducing the number of model pa¬ 
rameters by exploiting translational and rotational symme¬ 
try in the input images; 

• model averaging: averaging the predictions of several 
models. 


7.3 Preprocessing 

Images are first cropped and rescaled to reduce the dimen¬ 
sionality of the input. It was useful to crop the images be¬ 
cause the object of interest is in the middle of the image with 
a large amount of sky background, and typically fits within 
a square with a side of approximately half the image height. 
We then rescaled the images to speed up training, with little 
to no effect on predictive performance. Images were cropped 
from 424 x 424 pixels to 207 x 207, and then downscaled 3 
times to 69 x 69 pixels. 

Eor a small subset of the images, the cropping operation 
removed part of the object of interest, either because it had 
an unusually large angular size or because it was not per¬ 
fectly centred. We looked into recentering and rescaling the 
images by detecting and measuring the objects in the images 
using SExtractor (Bertin & Arnouts 1996). This allowed us 
to independently estimate both the position and Petrosian 
radii of the objects. This information is then used to centre 
and rescale all images to standardize the sizes of the objects 
before further processing. 

This normalization step had no significant effect on the 
predictive performance of our models. Nevertheless, we did 


train a few models using this approach, because even though 
they achieved the same performance in terms of RMSE com¬ 
pared to models trained without it, the models make differ¬ 
ent mistakes. This is useful in the context of model averaging 
(Section 7.8), where high variance among a set of compara¬ 
bly performing models is desirable (Bishop 2006). 

The images for the competition were provided in the 
same format that is used on the Galaxy Zoo website (424 x 
424 JPEG colour images). We found that keeping the colour 
information (instead of converting the images to grayscale) 
improved the predictive performance considerably, despite 
the fact that the colours are artificial and intended for hu¬ 
man eyes. These artificial colours are nevertheless correlated 
with morphology, and our models are able to exploit this 
correlation. 


7.4 Data augmentation 

Due to the limited size of the training set, performing data 
augmentation to artificially increase the number of training 
examples is instrumental. Each training example was ran¬ 
domly perturbed in five ways, which are shown in Eigure 
6 : 

• rotation: random rotation with an angle sampled uni¬ 
formly between 0° and 360°, to exploit rotational symmetry 
in the images. 

• translation: random shift sampled uniformly between 
—4 and 4 pixels (relative to the original image size of 424 by 
424) in the x and y direction. The size of the shift is limited 
to ensure that the object of interest remains in the centre of 
the image. 

• scaling: random rescaling with a scale factor sampled 
log-uniformly between 1.3and 1.3. 

• flipping: the image is flipped with a probability of 0.5. 

• brightness adjustment: the colour of the image is 
adjusted as described by Krizhevsky et al. (2012), with two 
differences: the first eigenvector has a much larger eigenvalue 
than the other two, so only this one is used, and the standard 
deviation for the scale factor is set to a = 0.5. In practice, 
this amounts to a brightness adjustment. 

The first four of these are affine transformations, which 
can be collapsed into a single transformation together with 
the one used for preprocessing. This means that the data 
augmentation step has no noticeable computational cost. To 
maximize the effect of data augmentation, we randomly per¬ 
turbed the images on demand during training, so the models 
were never presented with the exact same training example 
more than once. 


7.5 Viewpoint extraction 

After preprocessing and augmentation, we extracted view¬ 
points by rotating, flipping and cropping the input images. 
We extracted 16 different viewpoints for each image: first, 
two square-shaped crops were extracted from an input im¬ 
age, one at 0° and one at 45°. Both were also flipped hor¬ 
izontally to obtain 4 crops in total. Each of these crops is 
69 X 69 pixels in size. Then, four overlapping corner patches 
of 45 X 45 pixels were extracted from each crop, and rotated 
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(a) none (b) rotation (c) translation (d) scaling (e) flipping (f) brightness 

Figure 6. The flve types of random data augmentation used in this model. Note that the effect of translation and brightness adjustment 
is fairly subtle. 



(a) 4 crops from an image 


(b) 4 viewpoints from each crop 

Figure 7. Obtaining 16 viewpoints from an input image, (a) First, 
two square-shaped crops are extracted from the image, one at 
0° (red outline) and one at 45° (blue outline). Both are also dipped 
horizontally to obtain 4 crops in total, (b) Then, four overlapping 
corner patches are extracted from each crop, and they are rotated 
so that the galaxy centre is in the bottom right corner of each 
patch. These 16 rotated patches constitute the viewpoints. This 
flgure is best viewed in colour. 

SO that the centre of the galaxy is in the bottom right cor¬ 
ner of each patch. These 16 rotated patches constitute the 
viewpoints (Figure 7). 

This approach allowed us to obtain 16 different view¬ 
points with just two affine transformation operations, thus 
avoiding additional computation. All viewpoints can be ob¬ 
tained from the two original crops without interpolation 
(which in practice are array indexing operations). This also 
means that image edges and padding have no effect on the 
input, and that the loss of image fidelity after preprocessing, 
augmentation and viewpoint extraction is minimal. 

7.6 Network architecture 

All viewpoints were presented to the network as 45 by 45 
by 3 arrays of RGB values, scaled to the interval [0,1], and 
processed by the same convolutional architecture. The re¬ 


sulting feature maps were then concatenated and processed 
by a stack of three fully connected layers to map them to 
the 37 answer probabilities. 

The architecture for the best performing network is vi¬ 
sualized in Figure 8. There are four convolutional layers, all 
with square filters, with filter sizes 6, 5, 3 and 3 respectively, 
and with untied biases (i.e. each spatial position in each fea¬ 
ture map has a separate bias, see Section 5.2). The rectifica¬ 
tion non-linearity is applied after each layer (Nair & Hinton 
2010). 2 by 2 max-pooling follows the first, second and fourth 
convolutional layers. The concatenated feature maps from all 
viewpoints are processed by a stack of three fully connected 
layers, consisting of two maxout layers (Goodfellow et al. 
2013) with 2048 units with two linear filters each, and a lin¬ 
ear layer that outputs 37 real numbers. Maxout layers were 
used instead of ReLU layers to reduce the number of connec¬ 
tions to the next layer (and thus the number of parameters). 
We did not use maxout in the convolutional layers because 
it proved too computationally intensive. 

We arrived at this particular architecture after a manual 
parameter search: more than 100 architectures were evalu¬ 
ated over the course of the competition, and this one was 
found to yield the best predictive performance. The network 
has roughly 42 million trainable parameters in total. Table 2 
lists the hyperparameter settings for the trainable layers. 

The 37 values that the network produces for an input 
image are converted into a set of probabilities. First, the val¬ 
ues are passed through a rectification non-linearity, and then 
normalized per question to obtain a valid categorical proba¬ 
bility distribution for each question. Valid probability distri¬ 
butions could also be obtained by using a softmax function 
per question, instead of rectification followed by normaliza¬ 
tion. However, this decreased the overall performance since 
it was harder for the network to predict a probability of ex¬ 
actly 0 or 1. 

The distributions still need to be rescaled, however; they 
give the probability of an answer conditional on its associ¬ 
ated question being asked, but each user is only asked a 
subset of the questions. This implies that some questions 
have a lower probability of being asked, so the probabilities 
of the answers to these questions should be scaled down to 
obtain unconditional probabilities, fn practice we scale them 
by the probabilities of the answers that preceded them in the 
decision tree (see Figure 1). 

This post-processing operation is incorporated into the 
network. Because it consists only of differentiable opera¬ 
tions^, the gradient of the objective function can be back- 
propagated through it. This guarantees that the output of 

^ Although the rectiflcation operation is not technically differen¬ 
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maxout(2) maxout(2) 

Figure 8. Schematic overview of the architecture of the best performing network that we trained. The sizes of the filters and feature 
maps are indicated for each layer. 



type 

# features 

filter size 

non-linearity 

initial biases 

initial weights 

1 

convolutional 

32 

6x6 

ReLU 

0.1 

^(0,0.01) 

2 

convolutional 

64 

5x5 

ReLU 

0.1 

Uo.o.oi) 

3 

convolutional 

128 

3x3 

ReLU 

0.1 

Uo.o.oi) 

4 

convolutional 

128 

3x3 

ReLU 

0.1 

V(0,0.1) 

5 

dense 

2048 

- 

maxout (2) 

0.01 

^(0, 0.001) 

6 

dense 

2048 

- 

maxout (2) 

0.01 

Uo, 0.001) 

7 

dense 

37 

- 

constraints 

0.1 

^/■(O.O.Ol) 


Table 2. The hyperparameters of the trainable layers of the best performing network that we trained, also depicted in Figure 8. The 
last two columns describe the initialization distributions of the weights and biases of each layer. See Section 7.6 for a description of the 
incorporation of the output constraints into the last layer of the network. 


the network will not violate the constraints that the an¬ 
swer probabilities must adhere to (for example, pbar must be 
greater to or equal to Pspirai in the cumulative probabilities, 
since it is a higher-level question in the decision tree). This 
resulted in a small but significant performance improvement. 

In addition to the best performing network, we also 
trained variants for the purpose of model averaging (see Sec¬ 
tion 7.8). These networks differ slightly from the best per¬ 
forming network, and make slightly different predictions as 
a result. Variants included: 

• a network with only two dense layers instead of three; 

• a network with a different hlter size conhguration (hlter 
sizes 8, 4, 3, 3 respectively instead of 6, 5, 3, 3); 

• a network with ReLUs in the dense layers instead of 
maxout units; 

• a network with 256 hlters instead of 128 in the topmost 
convolutional layer. 

In total, 17 different networks were trained on this data 

set. 


tiable everywhere, it is subdifferentiable so this does not pose a 
problem in practice. 


7.7 Training 

To train the models we used minibatch gradient descent with 
a batch size® of 16 and Nesterov momentum (Bengio et al. 
2013) with coefficient fi = 0.9. Nesterov momentum is a 
method for accelerating gradient descent by accumulating 
gradients over time in directions that consistently decrease 
the objective function value. This and similar methods have 
are commonly used neural network training because they 
speed up the training process and often lead to improved 
predictive performance (Sutskever et al. 2013). 

We performed approximately 1.5 million gradient up¬ 
dates, corresponding to 25 million training examples. Fol¬ 
lowing Krizhevsky et al. (2012), we used a discrete learning 
rate schedule to improve convergence. We began with a con¬ 
stant learning rate r] = 0.04 and decreased it tenfold twice: 
it was decreased to 0.004 after 18 million examples, and to 
0.0004 after 23 million examples. For the hrst 10,000 ex¬ 
amples, the output constraints were ignored, and the linear 
output of the top layer of the network was simply clipped 
between 0 and 1. This was necessary to ensure convergence. 

Weights in the model were initialized by sampling from 


® The batch size chosen is small because the convolutional part 
of the network is applied 16 times to different viewpoints of the 
input images, yielding an effective batch size of 256. 
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zero-mean normal distributions (Bengio 2012). The vari¬ 
ances of these distributions were fixed at each layer, and 
were manually chosen to ensure proper flow of the gradient 
through the network. All biases were initialized to positive 
values to decrease the risk of units getting stuck in the sat¬ 
uration region. Although this is not necessary for maxout 
units, the same strategy was used for the dense layers. The 
initialization strategy for all layers is shown in the last two 
columns of Table 2. 

During training, we used dropout (Hinton et al. 2012) in 
all three dense layers. Using dropout was essential to reduce 
overfitting to manageable levels. 


7.8 Model averaging 

To further improve the prediction accuracy, we averaged the 
predictions of several different models, and across several 
transformations of the input images. Two requirements for 
model averaging to be effective is that each individual model 
must have roughly the same prediction accuracy, and the 
prediction errors should be as uncorrelated as possible. 

For each model, we computed predictions for 60 affine 
transformations of the input images: a combination of 10 ro¬ 
tations, spaced by 36°, 3 rescalings (with scale factors 1.2“^, 
1 and 1.2) and optional horizontal flipping. An unweighted 
average of the predictions was computed. Even though the 
model is trained to be robust to these types of deforma¬ 
tions (see Section 7.4), computing averaged predictions in 
this fashion still helped to increase prediction accuracy (see 
Table 3). 

In total, 17 variants of the model were trained with pre¬ 
dictions computed from the mean across 60 transformations. 
This resulted in 1020 sets of predictions averaged in total. 


7.9 Implementation 

All aspects of the model were implemented using Python 
and the Theano library (Bergstra et al. 2010; Bastien et al. 
2012). This allowed the use of GPU acceleration without 
any additional effort. Theano is also able to perform au¬ 
tomatic differentiation, which simplifies the implementation 
of gradient-based optimization techniques. Networks were 
trained on NVIDIA GeForce GTX 680 cards. Data augmen¬ 
tation was performed on the GPU using the scikit-image 
package (van der Walt et al. 2014) in parallel with model 
training on the GPU. Training the network described in Sec¬ 
tion 7.6 took roughly 67 hours in real time. 

The code to reproduce the winning submis¬ 
sion for the Galaxy Ghallenge is available at 
https://github.com/benanne/kaggle-galaxies. 


8 RESULTS 

Gompetition results of the models are listed in Table 3. We 
report the performance of our best performing network, with 
and without averaging across 60 transformations, as well as 
that of the combination of all 17 variants. The root-mean- 
square error in Table 3 is the same metric used to score 
submissions in the Galaxy Ghallenge (Equation 1). Both av¬ 
eraging across transformations and averaging across differ¬ 
ent models contributed significantly to the final score. It is 


model 

leaderboard score 


public 

private 

best performing network 

0.07671 

0.07693 

-h averaging over 60 transformations 

0.07579 

0.07603 

-h averaging over 17 networks 

0.07467 

0.07492 


Table 3. Performance (in RMSE) of the best performing network, 
as well as the performance after averaging across 60 transforma¬ 
tions of the input, and across 17 variants of the network. Please 
refer to Section 3 for details on how the scores were computed. 


worth noting that our model performs well even without any 
model averaging, which is important because fast inference 
is desirable for practical applications. If predictions are to be 
generated for millions of images, combining a large number 
of predictions for each image would require an impractical 
amount of computation. 

Although morphology prediction was framed as a re¬ 
gression problem in the competition (see Section 3), it is 
fundamentally a classification task. To demonstrate the ca¬ 
pabilities of our model in a more interpretable fashion, we 
can look at classification accuracies. For each question, we 
can obtain classifications by selecting the answer with the 
highest probability for each image. We can do this both for 
the probabilities obtained from Galaxy Zoo participants, and 
for the probabilities predicted by our model. We can then 
compute the classihcation accuracy simply by counting the 
number of images for which the classifications match up. Re¬ 
ducing the probability distributions to classifications in this 
fashion clearly causes some information to be discarded, but 
classification accuracy is a metric that is much easier to in¬ 
terpret. 

To find out how the level of agreement between the 
Galaxy Zoo participants affects the accuracy of the predic¬ 
tions of our model, we can compute the entropy of the prob¬ 
ability distribution over the answers for a given question. 
The entropy of a discrete probability distribution p over n 
options xi,..., Xn is given by: 


n 

H{p) =-^p{xi)\ogp{xi). (6) 

i=l 

If the entropy is minimal, all participants selected the 
same answer (i.e. everyone agreed). If the entropy is maxi¬ 
mal, all answers were equally likely to be selected. The en¬ 
tropy ranges between 0 and log(n). We can convert it into a 
measure of agreement a{p) as follows: 


a{p) = 1 - 


Hjp) ^ 

log(n) ■ 


(7) 


The quantity a(p) will equal 0 in case of maximal dis¬ 
agreement, and 1 in case of maximal agreement. 

To assess the conditions under which the predictions of 
the model can be trusted, we can measure the conhdence of 
a prediction using the same measure a{p) by applying it to 
the probability distributions predicted by the model, instead 
of the distributions of the crowdsourced answers. This allows 
us to relate model confidence and prediction accuracy. 

For each question, we selected the subset of images from 


© 2014 RAS, MNRAS 000 , 1-20 



Convnets for galaxy morphology prediction 13 


the real-time evaluation set® for which at least 50% of par- 
ticipants answered the question. This is to ensure that we 
only consider images for which the question is likely rele¬ 
vant. We ranked all images in this subset according to the 
measure a(jp) and divided them into 10 equal bins. We did 
this seperately for both the crowdsourced answers and the 
model predictions. For each bin, we computed the average 
of a(jp) and the classification accuracy using the best per¬ 
forming network (no averaging). These values are visualized 
in a set of graphs for each question in Figure 9. The red 
circles show classification accuracy versus agreement. The 
blue squares show classification accuracy versus model con¬ 
fidence. The classification accuracy across the entire subset 
is also shown as a thick horizontal line. The dashed hori¬ 
zontal lines indicate the maximal accuracy of 100% and the 
chance-level accuracy, which depends on the number of op¬ 
tions. The number of images in each subset and the overall 
classification accuracy are indicated above the graphs. 

For all questions, the classification accuracy tapers off 
as the level of agreement between Galaxy Zoo participants 
decreases. This makes sense, as those images are harder to 
classify. Kuminski et al. (2014) report similar results us¬ 
ing the WND-CHARM algorithm, with lowest accuracies for 
features describing spiral arm and irregular structures. Our 
model achieves near-perfect accuracy for most of the ques¬ 
tions when the level of agreement is high. Classifications for 
bulge dominance (Q5) and spiral arm tightness (QIO) have 
low agreement overall, and are also more difficult to answer 
for the model. 

Similarly, the confidence of the model in its predictions 
is correlated with classification accuracy: we achieve near¬ 
perfect accuracy for most questions when the model is highly 
confident. This is a useful property, because it allows us to 
determine when we are able to trust the predictions, and 
when we should defer to an expert instead. As a conse¬ 
quence, the model could be used to filter a large collection 
of images, in order to obtain a much smaller set that can be 
annotated manually by experts. Such a two-stage approach 
would greatly reduce the experts’ workload at virtually no 
cost in terms of accuracy. 

For questions 1, 2, 3, 6 and 7 in particular, we are able 
to make confident, accurate predictions for the majority of 
examples. This would allow us to largely automate the as¬ 
sessment of e.g. smoothness (Ql) and roundedness (Q7). For 
questions 5 and 10 on the other hand, confidence is low across 
the board and the classification accuracy is usually too low 
to be of practical use. As a result, determining bulge domi¬ 
nance (Q5) and spiral arm tightness (QIO) would still require 
a lot of manual input. The level to which we are able to auto¬ 
mate the annotation process depends on the morphological 
properties we are interested in, as well as the distribution of 
morphology types in the dataset we wish to analyse. 

To assess how well the model is able to predict various 
different morphology types, we computed precision and re¬ 
call scores for all answers individually. The precision (P) and 
recall (P) scores are defined in terms of the number of true 


® We could also have conducted this analysis on the evaluation 
set from the competition, but since the true answer probabilities 
for the real-time evaluation set were readily available and this set 
contains over 6,000 images, we used this instead. 


positive (TP), false positive (PP) and false negative {FN) 
classifications as follows: 

TP TP 

P — ^ ^ B = ^ ^ /o') 

TP + FP' TP + FN' ^ ’ 

The scores are listed in Table 4. We used the same strat¬ 
egy as before to obtain classifications, and only considered 
those examples for which at least 50% of the Galaxy Zoo 
participants answered the question. The numbers of exam¬ 
ples that were available for each question and answer are 
also shown. 

From these scores, we can establish that the model has 
more difficulty with morphology types that occur less fre¬ 
quently in the dataset, e.g., star or artifact (A1.3), no bulge 
(A5.1), dominant bulge (A5.4) and dust lane (A8.7). We note 
that images in the first category are attempted to be deli- 
brately excluded from the Galaxy Zoo data set via flags in 
the SDSS pipeline. Both the precision and recall scores are 
affected, so this effect cannot be attributed entirely to a bias 
towards more common morphologies. However, recall is gen¬ 
erally affected more strongly than precision, which indicates 
that the model is more conservative in predicting rare mor¬ 
phology types. For a few very rare answers, we were unable 
to compute precision scores because the model never pre¬ 
dicted them for the examples that were considered: lens or 
arc (A8.2), boxy bulge (A9.2) and four spiral arms (A11.4). 
While these are all rare morphologies, they have consider¬ 
able scientific interest and constructing a model that can 
accurately identify them is still a primary goal. 


9 ANALYSIS 

Traditionally, neural networks are often treated as black 
boxes that perform some complicated and uninterpretable 
sequence of computations that yield a good approximation 
to the desired output. However, analysing the parameters 
of a trained model can be very informative, and sometimes 
even leads to new insights about the problem the network is 
trying to solve (Zeiler & Fergus 2014). This is especially true 
for convolutional neural networks trained on images, where 
the first-layer filters can be interpreted visually. 

Figure 10 shows the 32 filters learned in the first layer of 
the best performing network described in Section 7.6. Each 
filter was contrast-normalized individually to bring out the 
details, and the three colour channels are shown separately. 
Gomparing the filter weights across colour channels reveals 
that some filters are more sensitive to particular colours, 
while others are sensitive to patterns, edges and textures. 
The same phenomenon is observed when training convolu¬ 
tional neural networks on more traditional image datasets. 
The filters for edge detection seem to be looking for curved 
edges in particular, which is to be expected because of the 
radial symmetry of the input images. 

Figures 11 and 12 show how an input viewpoint (i.e. a 
45 X 45 part of an input image, see Section 7.5) activates 
the units in the convolutional part of the network. Note that 
the geometry of the input image is still apparent in the acti¬ 
vations of higher convolutional layers. The activations of all 
layers except the third are also quite sparse, especially those 
of the fourth layer. One possible reason why the third layer 
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Figure 9. Level of agreement (red circles) and model confidence (blue squares) versus classification accuracy for all questions (see 
Table 1), computed on the real-time evaluation set. The overall classification accuracy is indicated as a thick horizontal line. The dotted 
and dashed horizontal lines indicate the maximal accuracy of 100% and the chance-level accuracy respectively. The number of images 
that were included in the analysis and the overall classification accuracy for each question are indicated above the graphs. 


activations are not as sparse is because there is no pooling 
layer directly following it. 

It is also possible to visualize what neurons in the top¬ 
most hidden layer of the network (i.e. just before the output 
layer) have learned about the data, by selecting representa¬ 
tive examples from the test set that maximize their activa¬ 
tions. This reveals what type of inputs the unit is sensitive 


to, and what kind of invariances it has learned. Because we 
used maxout units in this layer, we can also select examples 
that minimally activate the units, allowing us to determine 
which types of inputs each unit discriminates between. 

Figure 13 shows such a visualization for three different 
units. Clearly each unit is able to discriminate between two 
distinct types of galaxies. The units also exhibit rotation in- 
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Figure 10. The 32 filters learned in the first convolutional layer of the best-performing network. Each filter was contrast-normalized 
individually across all channels. 


variance, as well as some scale invariance. For some units, 
we observed selectivity only in the positive or in the nega¬ 
tive direction (not shown). A minority of units seem to be 
multimodal, activating in the same direction for two or more 
distinct types of galaxies. Presumably the activation value 
of these units is disambiguated in the context of all other 
unit values. 

The unit visualized in Figure 13b detects imaging ar¬ 
tifacts: black lines running across the centre of the images, 
which are the result of dead pixels in the SDSS camera. 
This is interesting because such (known) artifacts are not 
morphological features of the depicted galaxies. It turns out 
that the network is trying to replicate the behaviour of the 
Galaxy Zoo participants, who tend to classify images fea¬ 
turing such artifacts as disturbed galaxies (answer A8.3 in 
Table 1), even though this is not the intended meaning of 
this answer. Most likely this is because the button for this 
answer in the Galaxy Zoo 2 web interface seems to feature 
such a black line. 

Finally, we can look at some examples from the real¬ 
time evaluation set (see Section 7.1) with low and high pre¬ 
diction errors, to get an idea of the strengths and weaknesses 
of the model (Figure 14). The reported RMSE values were 
obtained with the best performing network and without any 
averaging, and without centering or rescaling. 

The images that are difficult to classify are quite var¬ 
ied. Some are faint, but look fairly typical otherwise, such 
as Figure 14a. Most are negatively affected by the cropping 
operation in various ways: either because they are not prop¬ 
erly centred, or because they are very large (Figures 14b 
and 14c respectively). This was the original motivation for 
introducing an additional rescaling and centering step during 
preprocessing, but did not end up improving the overall pre¬ 
diction accuracy. The easiest galaxies to classify are mostly 
smooth, round ellipticals. 


10 CONCLUSION AND FUTURE WORK 

We present a convolutional neural network for fine-grained 
galaxy morphology prediction, with a novel architecture 
that allows us to exploit rotational symmetry in the in¬ 
put images. The network was trained on data from the 
Galaxy Zoo 2 project and is able to reliably predict vari¬ 
ous aspects of galaxy morphology directly from raw pixel 


data, without requiring any form of handcrafted feature ex¬ 
traction. It can automatically annotate large collections of 
images, enabling quantitative studies of galaxy morphology 
on an unprecedented scale. 

Our novel approach to exploiting rotational symmetry 
was essential to achieve state-of-the-art performance, win¬ 
ning the Galaxy Ghallenge hosted on Kaggle. Although our 
winning solution required averaging many sets of predictions 
from different networks for each image, using a single net¬ 
work also yields competitive results. 

Our model can be adapted to work with any collection of 
centered galaxy images and arbitrary morphological decision 
trees. Our implementation was developed using open source 
tools and the source code is publicly available. The model 
can be trained and used on consumer hardware. Its predic¬ 
tions are highly reliable when they are confident, making 
our approach applicable for fine-grained morphological anal¬ 
ysis of large-scale survey data. Performing such large-scale 
analyses is an important direction for future research. 

For future work, we would like to train networks on 
larger collections of annotated images. From previous ap¬ 
plications in the domain of computer vision, it has become 
clear that the performance of convolutional neural networks 
scales very well with the size of the dataset. The ~ 55,000 
galaxy images used in this paper (90% of the provided train¬ 
ing set) is quite a small dataset by modern standards. Even 
though we combined several techniques to avoid overfitting, 
which allowed us to train very large models on this dataset 
effectively, a clear opportunity to improve predictive per¬ 
formance is to train the same model on a larger dataset, 
since Galaxy Zoo has already collected annotations for a 
much larger number of images. More recent iterations of 
the Galaxy Zoo project have concentrated on higher red- 
shift samples, so care will have to be taken to ensure that 
the model is able to generalize across different redshift slices. 

The use of larger datasets may also allow for a further 
increase in model capacity (i.e. the number of trainable pa¬ 
rameters) without the risk of excessive overfitting. These 
high-capacity models could be used as the basis for much 
larger surveys such as the LSST. The integration of model 
predictions into existing annotation workflows, both by ex¬ 
perts and through crowdsourcing platforms, will also require 
further study. 

Another possibility is the application of our approach to 
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Figure 11. Activations of each layer in the convolutional part of the best performing network, given the input viewpoint shown in the 
top left. The number of feature maps and the size of each map is indicated below each figure. The geometry of the input image is still 
apparent in the activations of higher convolutional layers. The activations of all layers except the third are also quite sparse. 
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Figure 12. Same as Figure 11, but for a different input viewpoint in the upper left. 
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(c) 


Figure 13. Example images from the test set that maximally and minimally activate units in the topmost hidden layer of the best 
performing network. Each group of 12 images represents a unit. The top row of images in each group maximally activate the unit, and 
bottom row of images minimally activate it. From top to bottom, these galaxies primarily correspond to the Galaxy Zoo 2 labels of: loose 
winding arms, edge-on disks, irregulars, disturbed, other, and tight winding arms. 


raw photometric data which have not been preprocessed for 
visual inspection by humans. The networks should be able 
to learn useful features from this representation, including 
structural changes from multiple wavebands (eg, HauBler 
et al. 2013). Automated classification of other data modal¬ 
ities that exhibit radial symmetry (a commonly occurring 
property in nature, e.g. in flowers, animals) also presents an 
interesting opportunity. 


From a machine learning point of view, we would like to 
investigate improved network architectures based on recent 
developments, such as the trend towards deeper networks 
with in excess of 20 layers of processing and the use of smaller 
receptive fields (Szegedy et al. 2014; Simonyan & Zisserman 
2014). 
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(a) 0.24610 (b) 0.21877 (c) 0.21145 (d) 0.20088 



(e) 0.01112 (f) 0.01174 (g) 0.01187 (h) 0.01223 


Figure 14. Example images from the real-time evaluation set, along with their prediction RMSEs for the best-performing network. The 
images on the top row were the most difficult for the model to classify; the images on the bottom row were the easiest. Larger angular 
size and non-radially symmetric morphology are the most challenging targets for the model. 
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Table 4. Precision and recall scores for each answer. We com¬ 
pute these values only for the subset of examples in the real-time 
evaluation set where at least 50% of participants answered the 
question. We also give the number of examples that are in this 
subset for each answer. A question mark indicates that we were 
unable to compute the precision score because the model did not 
predict this answer for any of the considered examples. 
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